Among all cancer types, lung cancer is the most lethal among both males and females. It is responsible for 28% of cancer-related mortality. The life expectation for those who have lung cancer are very poor. It is usually diagnosed in an advanced phase, and the patients only have a 16% survival rate after 5 years of their diagnosis.
The Cancer Genome Atlas (TCGA) has comprehensively profiled this type of cancer in a patient cohort. Here, we analyze the expression profiles of those patients, using a pipeline based on the R/Bioconductor software package Rsubread.
This document is written in R markdown and
should be processed using R and you need to install the packages
knitr and markdown. Moreover, it using the official style for Bioconductor vignettes facilitated by the Bioconductor package BiocStyle. It is compiled from different files following the instructions included in the makefile.
The assessment and analysis of lung squamous cell carcinoma starts by importing the raw table of counts.
library(SummarizedExperiment)
se.unpaired <- readRDS(file.path("rawCounts", "seLUSC.rds"))
se.unpaired
class: RangedSummarizedExperiment
dim: 20115 553
metadata(5): experimentData annotation cancerTypeCode
cancerTypeDescription objectCreationDate
assays(1): counts
rownames(20115): 1 2 ... 102724473 103091865
rowData names(3): symbol txlen txgc
colnames(553): TCGA.18.3406.01A.01R.0980.07
TCGA.18.3407.01A.01R.0980.07 ... TCGA.90.7767.11A.01R.2125.07
TCGA.92.7340.11A.01R.2045.07
colData names(549): type bcr_patient_uuid ...
lymph_nodes_aortic_pos_by_ihc lymph_nodes_aortic_pos_total
table(se.unpaired$type)
normal tumor
51 502
Information can already be obtained. For example, that the file contains 20115 genes from a total of 553 samples. Of this samples, 502 are tumor samples and 51 are normal samples, which matches the information of the dataset from The Cancer Genome Atlas (TCGA) program.
Next, the column (phenotypic) data can be explored, which in this case corresponds to clinical variables and their corresponding metadata. For example, the column “type” consists of tumor and normal labels. Other information such as gender, tumor stage, and smoking status are also shown.
In any case, all the possible labels can be accessed with the last line of code of the code block below, which returns two columns of information about the clinical variables. One called labelDescription contains a succint description of the variable, often not more self-explanatory than the variable name itself, and the other called ‘CDEID’ corresponds to the so-called Common Data Element (CDE) identifier. This identifier can be use in https://cdebrowser.nci.nih.gov to search for further information about the associated clinical variable using the Advanced search form and the Public ID attribute search.
dim(colData(se.unpaired))
[1] 553 549
colData(se.unpaired)[1:5, 1:5]
DataFrame with 5 rows and 5 columns
type bcr_patient_uuid
<factor> <factor>
TCGA.18.3406.01A.01R.0980.07 tumor 95b83006-02c9-4c4d-bf84-a45115f7d86d
TCGA.18.3407.01A.01R.0980.07 tumor 4e1ad82e-23c8-44bb-b74e-a3d0b1126b96
TCGA.18.3408.01A.01R.0980.07 tumor d4bc755a-2585-4529-ae36-7e1d88bdecfe
TCGA.18.3409.01A.01R.0980.07 tumor b09e872a-e837-49ec-8a27-84dcdcabf347
TCGA.18.3410.01A.01R.0980.07 tumor 99599b60-4f5c-456b-8755-371b1aa7074e
bcr_patient_barcode form_completion_date
<factor> <factor>
TCGA.18.3406.01A.01R.0980.07 TCGA-18-3406 2011-3-9
TCGA.18.3407.01A.01R.0980.07 TCGA-18-3407 2011-3-9
TCGA.18.3408.01A.01R.0980.07 TCGA-18-3408 2011-3-9
TCGA.18.3409.01A.01R.0980.07 TCGA-18-3409 2011-3-17
TCGA.18.3410.01A.01R.0980.07 TCGA-18-3410 2011-4-4
prospective_collection
<factor>
TCGA.18.3406.01A.01R.0980.07 NO
TCGA.18.3407.01A.01R.0980.07 NO
TCGA.18.3408.01A.01R.0980.07 NO
TCGA.18.3409.01A.01R.0980.07 NO
TCGA.18.3410.01A.01R.0980.07 NO
mcols(colData(se.unpaired), use.names=TRUE)
DataFrame with 549 rows and 2 columns
labelDescription
<character>
type sample type (tumor/normal)
bcr_patient_uuid bcr patient uuid
bcr_patient_barcode bcr patient barcode
form_completion_date form completion date
prospective_collection tissue prospective collection indicator
... ...
lymph_nodes_pelvic_pos_total total pelv lnp
lymph_nodes_aortic_examined_count total aor lnr
lymph_nodes_aortic_pos_by_he aln pos light micro
lymph_nodes_aortic_pos_by_ihc aln pos ihc
lymph_nodes_aortic_pos_total total aor-lnp
CDEID
<character>
type NA
bcr_patient_uuid NA
bcr_patient_barcode 2673794
form_completion_date NA
prospective_collection 3088492
... ...
lymph_nodes_pelvic_pos_total 3151828
lymph_nodes_aortic_examined_count 3104460
lymph_nodes_aortic_pos_by_he 3151832
lymph_nodes_aortic_pos_by_ihc 3151831
lymph_nodes_aortic_pos_total 3151827
The row (feature) data provides information on genes. For the first line of command, the length of each gene and the GC content can be seen. The second line of command provides further information, such as the ranges within individual chromosome.
rowData(se.unpaired)
DataFrame with 20115 rows and 3 columns
symbol txlen txgc
<character> <integer> <numeric>
1 A1BG 3322 0.564419024683925
2 A2M 4844 0.488232865400495
9 NAT1 2280 0.394298245614035
10 NAT2 1322 0.389561270801815
12 SERPINA3 3067 0.524942940984676
... ... ... ...
100996331 POTEB 1706 0.430832356389215
101340251 SNORD124 104 0.490384615384615
101340252 SNORD121B 81 0.407407407407407
102724473 GAGE10 538 0.505576208178439
103091865 BRWD1-IT2 1028 0.592412451361868
rowRanges(se.unpaired)
GRanges object with 20115 ranges and 3 metadata columns:
seqnames ranges strand | symbol txlen
<Rle> <IRanges> <Rle> | <character> <integer>
1 chr19 58345178-58362751 - | A1BG 3322
2 chr12 9067664-9116229 - | A2M 4844
9 chr8 18170477-18223689 + | NAT1 2280
10 chr8 18391245-18401218 + | NAT2 1322
12 chr14 94592058-94624646 + | SERPINA3 3067
... ... ... ... . ... ...
100996331 chr15 20835372-21877298 - | POTEB 1706
101340251 chr17 40027542-40027645 - | SNORD124 104
101340252 chr9 33934296-33934376 - | SNORD121B 81
102724473 chrX 49303669-49319844 + | GAGE10 538
103091865 chr21 39313935-39314962 + | BRWD1-IT2 1028
txgc
<numeric>
1 0.564419024683925
2 0.488232865400495
9 0.394298245614035
10 0.389561270801815
12 0.524942940984676
... ...
100996331 0.430832356389215
101340251 0.490384615384615
101340252 0.407407407407407
102724473 0.505576208178439
103091865 0.592412451361868
-------
seqinfo: 455 sequences (1 circular) from hg38 genome
In this analysis, a comparison between tumor and normal cells was done. For this, the dataset is subsetted to obtain only those samples which come from the same patient but have different type, setting a paired design. This strategy also has the benefit of reducing the possibility of false positives. In order to get them, the following code is executed.
# get the number of occurences of each patient
occur <- data.frame(table(substr(colnames(se.unpaired), 9, 12)))
# get those that occur twice
paired_df <- occur[occur$Freq > 1,]
paired <- as.vector(paired_df$Var1)
paired <- paired[2:length(paired)]
mask <- is.element(substr(colnames(se.unpaired), 9, 12), paired)
se <- se.unpaired[ ,mask]
saveRDS(se, file.path("results", "se.paired.rds"))
Since this data is unprocessed, a first step of quality assessment and normalization must be performed. To do this, the edgeR R/Bioconductor package is loaded. This requires the creation of a DGEList' object, as the package does not work directly withSummarizedExperiment` objects. In any case, updates to one of the objects will be reflected in the other.
library(edgeR)
dge <- DGEList(counts=assays(se)$counts, genes=mcols(se))
saveRDS(dge, file.path("results", "dge.paired.rds"))
One way of normalizing is to use the counts per million (CPM) values. This value is calculated by dividing the number of counts of each sample by 1 million. Instead of using this directly, the \(\log_2\) CPM values of expression are calculated as they have nicer distributional properties than raw counts or non-logged CPM units. This information is stored as an additional assay element to ease their manipulation.
assays(se)$logCPM <- cpm(dge, log=TRUE, prior.count=0.5)
assays(se)$logCPM[1:5, 1:5]
TCGA.22.4593.01A.21R.1820.07 TCGA.22.4609.01A.21R.2125.07
1 2.433139 1.760377
2 9.013463 10.810625
9 -6.817220 -6.817220
10 -6.817220 -6.817220
12 5.605909 5.647356
TCGA.22.5471.01A.01R.1635.07 TCGA.22.5472.01A.01R.1635.07
1 2.122666 0.775287
2 7.713750 11.100735
9 -6.817220 -6.817220
10 -6.817220 -6.817220
12 4.823947 5.856939
TCGA.22.5478.01A.01R.1635.07
1 3.713844
2 9.448174
9 -6.817220
10 -6.817220
12 4.920820
First, the sequencing depth in terms of total number of sequence read counts mapped to the genome per sample can be examined. Figure 1 below shows the sequencing depth per sample, also known as library size, in increasing order.
Figure 1: Library sizes in increasing order
This figure reveals no big changes in sequencing depth between samples. In addition, a uniform distribution of tumor and normal samples is seen across the figure. There is a cluster of normal samples on the right side with a higher CPM, which is not considered as problematic for further analyses. The identifiers of those samples with lower sequencing depth can be obtained using the commands below.
sampledepth <- round(dge$sample$lib.size / 1e6, digits=1)
names(sampledepth) <- substr(colnames(se), 6, 12)
sort(sampledepth)
43.6773 56.7823 77.7335 56.8309 56.8083 58.8386 77.7335 43.6773 56.7582
28.8 29.0 31.0 32.0 32.9 33.4 34.9 35.0 35.9
56.8201 56.8201 56.8623 34.8454 77.7142 56.7579 56.7730 56.8083 56.8082
36.5 37.0 37.0 37.6 38.7 39.7 40.2 40.3 40.4
56.8082 51.4081 34.8454 22.5481 56.8309 77.7337 33.4587 56.7823 56.7580
40.7 41.9 42.1 42.3 43.6 44.0 45.6 45.7 45.8
56.7579 85.7710 58.8386 34.7107 56.7731 22.5471 77.7142 56.7580 56.7731
45.9 46.5 46.6 47.3 47.6 47.8 48.4 49.7 50.0
85.7710 43.3394 39.5040 22.5472 60.2709 51.4079 90.6837 51.4080 77.7338
50.1 50.2 51.0 51.1 51.2 51.5 51.6 51.9 52.0
56.7222 77.7338 77.7138 56.7582 22.5471 43.5670 22.4609 22.5483 92.7340
52.4 52.4 52.8 52.9 53.6 53.6 54.2 54.4 54.6
77.8008 34.7107 90.7767 43.7657 22.4609 43.7657 77.7138 33.6737 43.7658
54.9 55.1 55.7 55.8 56.3 56.7 57.0 57.2 57.4
77.7337 77.8007 43.5670 56.7222 56.8623 33.4587 43.7658 43.6143 22.5481
57.6 57.6 58.0 58.0 59.1 59.6 59.9 60.0 60.3
22.5491 43.6647 92.7340 39.5040 22.5489 22.5482 77.8008 22.5489 22.5491
61.0 61.3 61.3 61.5 63.5 64.5 65.2 65.5 66.0
77.8007 22.5478 90.6837 90.7767 56.7730 33.6737 60.2709 22.5478 43.6143
66.4 68.0 68.2 68.4 68.8 69.3 69.8 71.0 75.9
22.5472 22.5482 22.4593 43.6771 22.5483 43.6647 22.4593 43.6771 51.4081
76.2 77.0 77.9 79.4 80.3 81.5 85.0 88.8 103.7
51.4080 51.4079 43.3394
119.5 122.6 124.1
Next step is to look at the distribution of expression values per sample in terms of logarithmic CPM units. Tumor and normal samples are displayed separately, and are shown in Figure 2. Each colored line in the graphs below represent a sample. Both graphs have two modes: the first represents genes not expressed in that sample, while the second represents the genes that are expressed.
Figure 2: Non-parametric density distribution of expression profiles per sample
In both graphs, some genes deviate a little from the rest, so they are noted as potential outliers.
To look at the distribution of expression levels across genes, the average expression per gene through all the samples is calculated. Figure 3 shows the distribution of those values.
Figure 3: Distribution of average expression level per gene
RNA sequence expression profiles that come from lowly-expressed genes can lead to artifacts in downstream differential expression analyses. Thus, it is common to set a threshold and filter out genes that fall below this value.
In the light of the plot above (Figure 3), a cutoff of 1 log CPM unit is considered as the minimum value of expression to select genes being expressed across samples. Using this cutoff we proceed to filter out lowly-expressed genes.
mask <- avgexp > 1
dim(se)
[1] 20115 102
se.filt <- se[mask, ]
dim(se.filt)
[1] 11872 102
dge.filt <- dge[mask, ]
dim(dge.filt)
[1] 11872 102
A this point, there are a total of 20115 genes and 553 samples. The un-normalized versions of the filtered expression data are stored.
saveRDS(se.filt, file.path("results", "se.paired.gfilt.unnorm.rds"))
saveRDS(dge.filt, file.path("results", "dge.paired.gfilt.unnorm.rds"))
Next, the normalization factors on the filtered expression data set using the calcNormFactors function are calculated.
dge.filt <- calcNormFactors(dge.filt)
Then, the raw log2 CPM units in the corresponding assay element of the SummarizedExperiment object are replaced by the normalized ones.
assays(se.filt)$logCPM <- cpm(dge.filt, log=TRUE, normalized.lib.sizes=TRUE, prior.count=0.25)
Finally, the normalized versions of the filtered expression data are stored.
saveRDS(se.filt, file.path("results", "se.paired.gfilt.norm.rds"))
saveRDS(dge.filt, file.path("results", "dge.paired.gfilt.norm.rds"))
MA-plots of the normalized expression profiles allow us to visualize how one sample compares to the average of the rest of the samples, therefore assessing its quality. Ttumor samples are in Figure 4.
Figure 4: MA-plots of the tumor samples
The MA-plots of the tumor samples provide the following insight:
The appearance of samples that differ from the mean can be problematic. It is worth identifying the name of these samples to keep track of them in subsequent quality assessments. If they probe to be more problematic in downstream analysis, those samples should be removed together with their normal pairs.
big_c
[1] "TCGA.56.7823" "TCGA.56.8623" "TCGA.77.7138"
The normal samples are in Figure 5.
Figure 5: MA-plots of the normal samples
In this case, the observations are:
In conclusion, either important expression-level dependent biases are not observed among the normal samples.
Batch effects are sub-groups of measurements that have a qualitatively different behavior across conditions and are unrelated to the variables in the study. It is important to determine if batch effects are present to know if any confounding variables are affecting the data.
Potential surrogate of batch effect indicators within normal and tumor samples can be obtained. Given that each sample names corresponds to a TCGA barcode (see https://wiki.nci.nih.gov/display/TCGA/TCGA+barcode), following the strategy described in http://bioinformatics.mdanderson.org/main/TCGABatchEffects:Overview, different elements of the TCGA barcode can be derived from it, and examine their distribution across samples.
tss <- substr(colnames(se.filt), 6, 7)
table(data.frame(TYPE=se.filt$type, TSS=tss))
TSS
TYPE 22 33 34 39 43 51 56 58 60 77 85 90 92
normal 10 2 2 1 8 3 12 1 1 7 1 2 1
tumor 10 2 2 1 8 3 12 1 1 7 1 2 1
center <- substr(colnames(se.filt), 27, 28)
table(data.frame(TYPE=se.filt$type, CENTER=center))
CENTER
TYPE 07
normal 51
tumor 51
plate <- substr(colnames(se.filt), 22, 25)
table(data.frame(TYPE=se.filt$type, PLATE=plate))
PLATE
TYPE 0980 1100 1635 1758 1820 1858 1949 2045 2125 2187 2247 2296 2326
normal 0 0 5 4 7 1 4 10 10 2 4 2 1
tumor 1 3 6 0 7 0 4 10 10 2 4 2 1
PLATE
TYPE A28V
normal 1
tumor 1
portionanalyte <- substr(colnames(se.filt), 18, 20)
table(data.frame(TYPE=se.filt$type, PORTIONANALYTE=portionanalyte))
PORTIONANALYTE
TYPE 01R 11R 21R 31R 41R
normal 48 3 0 0 0
tumor 11 28 8 2 2
samplevial <- substr(colnames(se.filt), 14, 16)
table(data.frame(TYPE=se.filt$type, SAMPLEVIAL=samplevial))
SAMPLEVIAL
TYPE 01A 01B 11A
normal 0 0 51
tumor 50 1 0
From this information the following observations can be done:
colnames(se.filt)[samplevial == "01B"]
[1] "TCGA.56.7823.01B.11R.2247.07"
With those results, the portion of analyte is used as a surrogate of batch effect indicator. Considering the outcome of interest as the molecular changes between sample types, tumor vs. normal, the cross-classification of this outcome with portion of analyte are examined.
table(data.frame(TYPE=se.filt$type, PORTIONANALYTE=portionanalyte))
PORTIONANALYTE
TYPE 01R 11R 21R 31R 41R
normal 48 3 0 0 0
tumor 11 28 8 2 2
It is shown how the majority of normal samples (48) come from the “01R”" analyte and 3 come from the “11R” analyte. However, the tumor samples span across five different analytes. This could potentially lead to a batch effect.
In order to examine how the samples group together, two approaches are used here: hierarchical clustering and multidimensional scaling, annotating the outcome of interest and the surrogate of batch indicator, or portion of analyte in this case.
The approaches are performed under a subset of the samples, as it is desirable to have a matrix without zeros. Samples belonging to portions 21R, 31R and 41R are removed, and as the experiment is paired, also their pairs.
por1 <- as.vector(substr(colnames(se.filt)[portionanalyte == "41R"], 9, 12))
por2 <- as.vector(substr(colnames(se.filt)[portionanalyte == "31R"], 9, 12))
por3 <- as.vector(substr(colnames(se.filt)[portionanalyte == "21R"], 9, 12))
allpor <- c(por1, por2, por3)
table(is.element(substr(colnames(se.filt), 9, 12), allpor))
FALSE TRUE
78 24
se.batch <- se.filt[ ,!is.element(substr(colnames(se.filt), 9, 12), allpor)]
dge.batch <- DGEList(counts=assays(se.batch)$counts, genes=mcols(se.batch))
new.portionanalyte <- substr(colnames(se.batch), 18, 20)
table(data.frame(TYPE=se.batch$type, PORTIONANALYTE=new.portionanalyte))
PORTIONANALYTE
TYPE 01R 11R
normal 36 3
tumor 11 28
dim(dge.batch)
[1] 11872 78
The code above removed 24 samples, which correspond to the 12 in the indicated portions and their pairs.
The log CPM values are calculated again with a higher prior count to moderate extreme fold-changes produced by low counts. The resulting dendrogram is shown in Figure 6.
logCPM <- cpm(dge.batch, log=TRUE, prior.count=3)
d <- as.dist(1-cor(logCPM, method="spearman"))
sampleClustering <- hclust(d)
batch <- as.integer(factor(new.portionanalyte))
sampleDendrogram <- as.dendrogram(sampleClustering, hang=0.1)
names(batch) <- colnames(se.batch)
outcome <- paste(substr(colnames(se.batch), 9, 12), as.character(se.batch$type), sep="-")
names(outcome) <- colnames(se.batch)
sampleDendrogram <- dendrapply(sampleDendrogram,
function(x, batch, labels) {
if (is.leaf(x)) {
attr(x, "nodePar") <- list(lab.col=as.vector(batch[attr(x, "label")]))
attr(x, "label") <- as.vector(labels[attr(x, "label")])
}
x
}, batch, outcome)
plot(sampleDendrogram, main="Hierarchical clustering of samples")
legend("topright", paste("Batch", sort(unique(batch)), levels(factor(new.portionanalyte))), fill=sort(unique(batch)))
Figure 6: Hierarchical clustering of the samples
The dendogram shows how samples cluster primarily by sample type, tumor or normal. The different portions are present in both clusters, indicating that they do not induce batch effect. The plot also shows how one of the tumor samples, 8623-tumor, is present in the normal samples cluster. This sample also had a strong deviation in the MA plot.
In Figure 7 the corresponding MDS plot is shown. It can be seen that the first source of variation separates tumor from normal samples, and there is not any sample that is separated from any cluster.
plotMDS(dge.batch, labels=outcome, col=batch)
legend("bottomleft", paste("Batch", sort(unique(batch)), levels(factor(new.portionanalyte))),
fill=sort(unique(batch)), inset=0.05)
Figure 7: Multidimensional scaling plot of the samples
Finally, under the consideration that there is no batch effect, the 24 samples that were removed in order to make the dendogram and the MDS plot will remain for further analysis. Then, the dendogram is generated again in Figure 8 in order to see if any other sample appears in its opposite cluster, apart from 8623-tumor.
logCPM <- cpm(dge.filt, log=TRUE, prior.count=3)
d <- as.dist(1-cor(logCPM, method="spearman"))
sampleClustering <- hclust(d)
batch <- as.integer(factor(portionanalyte))
sampleDendrogram <- as.dendrogram(sampleClustering, hang=0.1)
names(batch) <- colnames(se.filt)
outcome <- paste(substr(colnames(se.filt), 9, 12), as.character(se.filt$type), sep="-")
names(outcome) <- colnames(se.filt)
sampleDendrogram <- dendrapply(sampleDendrogram,
function(x, batch, labels) {
if (is.leaf(x)) {
attr(x, "nodePar") <- list(lab.col=as.vector(batch[attr(x, "label")]))
attr(x, "label") <- as.vector(labels[attr(x, "label")])
}
x
}, batch, outcome)
plot(sampleDendrogram, main="Hierarchical clustering of samples")
legend("topright", paste("Batch", sort(unique(batch)), levels(factor(new.portionanalyte))), fill=sort(unique(batch)))
Figure 8: Hierarchical clustering of the samples
Again, only the 8623-tumor sample does not cluster with the rest of the tumor samples. This sample will be removed from the data before further analysis. For this, new files will be saved without those samples.
number <- as.character(8623)
se.filt <- se.filt[ ,!is.element(substr(colnames(se.filt), 9, 12), number)]
dge.filt <- DGEList(counts=assays(se.filt)$counts, genes=mcols(se.filt))
dim(se.filt)
[1] 11872 100
dim(dge.filt)
[1] 11872 100
saveRDS(se.filt, file.path("results", "se.paired.gsfilt.norm.rds"))
saveRDS(dge.filt, file.path("results", "dge.paired.gsfilt.norm.rds"))
This leaves a number of 11872 and 100 samples, of which 50 are tumoral and 50.
sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.5
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] geneplotter_1.60.0 annotate_1.60.1
[3] XML_3.98-1.20 AnnotationDbi_1.44.0
[5] lattice_0.20-38 edgeR_3.24.3
[7] limma_3.38.3 SummarizedExperiment_1.12.0
[9] DelayedArray_0.8.0 BiocParallel_1.16.6
[11] matrixStats_0.54.0 Biobase_2.42.0
[13] GenomicRanges_1.34.0 GenomeInfoDb_1.18.2
[15] IRanges_2.16.0 S4Vectors_0.20.1
[17] BiocGenerics_0.28.0 knitr_1.23
[19] BiocStyle_2.10.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.1 RColorBrewer_1.1-2 compiler_3.5.3
[4] BiocManager_1.30.4 highr_0.8 XVector_0.22.0
[7] bitops_1.0-6 tools_3.5.3 zlibbioc_1.28.0
[10] bit_1.1-14 digest_0.6.19 memoise_1.1.0
[13] RSQLite_2.1.1 evaluate_0.14 Matrix_1.2-17
[16] DBI_1.0.0 yaml_2.2.0 xfun_0.7
[19] GenomeInfoDbData_1.2.0 stringr_1.4.0 bit64_0.9-7
[22] locfit_1.5-9.1 grid_3.5.3 rmarkdown_1.13
[25] bookdown_0.11 blob_1.1.1 magrittr_1.5
[28] codetools_0.2-16 htmltools_0.3.6 xtable_1.8-4
[31] KernSmooth_2.23-15 stringi_1.4.3 RCurl_1.95-4.12
Differential expression analysis consists of identifying changes in gene expression. For RNAseq experiments, this means comparing relative concentrations of RNA molecules. In R, this can be done by using the limma and sva R/Bioconductor packages.
limma allows for linear model analysis of data for DE. This requires building two model matrices: one including the outcome of interest and adjustment variables, and a matrix of the null model that only includes the adjustment variables. The outcome of interest is the type of sample (tumor vs normal), and as both types come from the same patient the analysis is paired, so a unique identifier for each patient is included.
Even if no known sources of variation were found, surrogate variables representing unknown sources of variation can be added to the model. Those can be estimated with the sva package, and later be appended to the model. Finally, voom is used to fit the model and the moderated t-statistics are calculated. This pipeline is chosen over limma-trend because the library size distribution is not uniform across samples.
library(sva)
library(limma)
# Obtain the patient ID, as the analysis is paired
patientid <- substr(colnames(se.filt), 9, 12)
se.filt$type <- relevel(se.filt$type, ref = "normal")
# Build the model (mod) and null (mod0) matrices
mod <- model.matrix(~type + patientid, data = colData(se.filt))
mod0 <- model.matrix(~patientid, data = colData(se.filt))
# Estimate surrogate variables
sv <- sva(assays(se.filt)$logCPM, mod = mod, mod0 = mod0)
Number of significant surrogate variables is: 12
Iteration (out of 5 ):1 2 3 4 5
# Add surrogates to the design matrix
len <- length(colnames(mod))
mod <- cbind(mod, sv$sv)
colnames(mod) <- c(colnames(mod)[1:len], paste0("SV", 1:sv$n))
v <- voom(dge.filt, mod, plot=TRUE)
Figure 9: Mean-variance plot for DE analysis
fit <- lmFit(v, mod) # Fit the model with the voom weights
fit <- eBayes(fit) # Calculate moderated t-statistics
sva was able to estimate 12 surrogate variables. Regarding the mean-variance plot shown in Figure 9, which was obtained with voom, it shows the variance in the y axis and the mean of expression in the x axis for each gene, represented as a black dot. Such plots contain a red curve that is fitted to the dots and show trends in the data. Normally increases in expression levels mean a decrease in the variance, until a plateau is reached for highly expressed genes. This trend is observed in the plot obtained for this data. Moreover, the plot is also a diagnosis of filtering steps performed upstream. The fact that there is not a drop in variance for the lower expression values indicates that the filtering threshold was good.
As multiple tests are being performed, the probability of type I errors (rejecting the null hypotheses when it is true) increases. There are multiple statistical strategies consisting of applying corrections to the p-values to decrease the likelihood of these errors, such as the false discovery rate (FDR). This technique consists of defining an acceptable rate of type I errors (false discoveries). Here, a FDR of 0.01 is set, which means that 1% of the rejections of the null hypothesis will be incorrect. For this data set, which has 11872 observations, up to 119 genes can be false positives (FP).
FDRcutoff <- 0.01
res <- decideTests(fit, p.value = FDRcutoff)
summary(res)
(Intercept) typetumor patientid3394 patientid4079 patientid4080
Down 5 4095 0 18 0
NotSig 784 2735 11870 11851 11871
Up 11083 5042 2 3 1
patientid4081 patientid4587 patientid4593 patientid4609 patientid5040
Down 1 4 1 0 0
NotSig 11870 11867 11869 11871 11871
Up 1 1 2 1 1
patientid5471 patientid5472 patientid5478 patientid5481 patientid5482
Down 0 0 0 5 1
NotSig 11871 11870 11871 11866 11870
Up 1 2 1 1 1
patientid5483 patientid5489 patientid5491 patientid5670 patientid6143
Down 0 1 0 0 1
NotSig 11871 11869 11870 11871 11869
Up 1 2 2 1 2
patientid6647 patientid6737 patientid6771 patientid6773 patientid6837
Down 4 0 1 1 0
NotSig 11867 11868 11869 11869 11871
Up 1 4 2 2 1
patientid7107 patientid7138 patientid7142 patientid7222 patientid7335
Down 1 2 4 0 4
NotSig 11870 11870 11865 11871 11866
Up 1 0 3 1 2
patientid7337 patientid7338 patientid7340 patientid7579 patientid7580
Down 0 1 5 0 3
NotSig 11870 11870 11860 11871 11867
Up 2 1 7 1 2
patientid7582 patientid7657 patientid7658 patientid7710 patientid7730
Down 2 4 4 17 3
NotSig 11867 11866 11868 11851 11867
Up 3 2 0 4 2
patientid7731 patientid7767 patientid7823 patientid8007 patientid8008
Down 5 0 5 4 8
NotSig 11864 11869 11866 11867 11863
Up 3 3 1 1 1
patientid8082 patientid8083 patientid8201 patientid8309 patientid8386
Down 6 0 0 0 0
NotSig 11865 11870 11871 11872 11871
Up 1 2 1 0 1
patientid8454 SV1 SV2 SV3 SV4 SV5 SV6 SV7 SV8 SV9
Down 8 1442 2106 783 778 71 4 36 24 82
NotSig 11863 8792 9315 10848 10350 11628 11618 11791 11825 11740
Up 1 1638 451 241 744 173 250 45 23 50
SV10 SV11 SV12
Down 60 16 1
NotSig 11688 11822 11865
Up 124 34 6
The summary table displays the upregulated, not significant, and downregulated genes for each variable in the model.
Gene metadata such as the gene symbol and the chromosome can be added for an easier interpretation of the obtained statistics to the results table, which is shown below this lines.
# Get gene info...
genesmd <- data.frame(chr = as.character(seqnames(rowRanges(se.filt))), symbol = rowData(se.filt)[, 1], stringsAsFactors = FALSE)
# and add it to the results
fit$genes <- genesmd
tt <- topTable(fit, coef = 2, n = Inf)
head(tt, n = 20)
chr symbol logFC AveExpr t P.Value adj.P.Val
171482 chrX FAM9A 6.856160 1.477480 39.26673 1.888897e-34 2.242498e-30
51569 chr13 UFM1 5.871020 3.035598 35.50838 1.092477e-32 6.484941e-29
23217 chr19 ZFR2 5.161345 6.564764 33.75968 8.297272e-32 3.283507e-28
2261 chr4 FGFR3 4.340034 4.461243 32.83373 2.526705e-31 7.499260e-28
27316 chrX RBMX 4.510505 3.627725 31.49972 1.324366e-30 2.913015e-27
6406 chr20 SEMG1 2.593916 4.942304 31.27703 1.757285e-30 2.913015e-27
284352 chr19 PPP1R37 5.460151 1.390301 31.37636 1.548654e-30 2.913015e-27
2079 chr14 ERH 4.296390 1.320862 31.14923 2.068708e-30 2.913015e-27
54536 chr10 EXOC6 4.827975 2.127249 31.09821 2.208316e-30 2.913015e-27
100616453 chr11 MIR4687 3.547233 5.839996 30.63877 3.994001e-30 4.343544e-27
9263 chr7 STK17A 5.560139 7.124666 30.63291 4.024510e-30 4.343544e-27
11252 chr22 PACSIN2 4.993374 1.672498 30.16299 7.441565e-30 7.362188e-27
81556 chr15 VWA9 4.719041 2.080915 30.02847 8.887556e-30 8.116389e-27
222546 chr6 RFX6 4.807681 2.002138 29.93165 1.010385e-29 8.568064e-27
51804 chr14 SIX4 3.237016 4.291469 29.80521 1.195294e-29 9.460352e-27
5782 chr7 PTPN12 3.302445 8.759576 29.68308 1.406851e-29 1.043884e-26
117156 chr5 SCGB3A2 4.533204 6.802293 29.56743 1.642521e-29 1.147059e-26
158062 chr9 LCN6 7.986699 6.999330 29.26947 2.454208e-29 1.618687e-26
29889 chr1 GNL2 5.919431 1.866216 29.20485 2.678874e-29 1.673874e-26
692205 chr2 SNORD89 4.627202 4.105932 29.12152 2.999938e-29 1.780763e-26
B
171482 68.07999
51569 64.28491
23217 62.53609
2261 61.37201
27316 59.68472
6406 59.49557
284352 59.42536
2079 59.09763
54536 59.08835
100616453 58.67169
9263 58.66579
11252 57.89290
81556 57.73250
222546 57.60020
51804 57.57738
5782 57.38661
117156 57.25347
158062 56.85789
29889 56.65382
692205 56.62971
The results table includes relevant statistics such as raw and adjusted p-values. In the case of the later ones, it can be seen some of their values are pretty low, which means that the result is significant. The number of differentially expressed genes under the 0.01 FDR cuttoff can be checked by applying a filter to this table.
signDEgenes <- rownames(tt)[tt$adj.P.Val < FDRcutoff]
length(signDEgenes)
[1] 9137
The differential expression analysis results in 9137 significantly differentially expressed genes. Those can still be filtered by a log fold change cutoff of 2 and -2, with the following chromosome distribution.
tt.sign <- tt[tt$adj.P.Val < FDRcutoff,]
DEgenes <- rownames(tt.sign)[abs(tt.sign$logFC) > 2]
length(DEgenes)
[1] 1451
saveRDS(DEgenes, file.path("results", "DEgenes.rds"))
So we are finally calling differentially expressed to 1451 genes. As we added the chromosome information to the table of results, the chromosome distribution of these genes can be obtained.
#sort(table(tt$chr[tt$adj.P.Val < FDRcutoff]), decreasing = TRUE)
sort(table(tt$chr[abs(tt$logFC) > 2]), decreasing = TRUE)
chr1 chr19 chr11
155 99 94
chr2 chr17 chr12
90 87 83
chr9 chr16 chr5
71 68 66
chr4 chr3 chr10
64 62 61
chrX chr6 chr7
61 60 60
chr20 chr15 chr14
47 44 41
chr8 chr22 chr13
41 30 27
chr21 chr18 chr1_KI270766v1_alt
23 16 1
#plot(sort(table(tt$chr[tt$adj.P.Val < FDRcutoff & !grepl(".*[A|v][1]*$", tt$chr, perl=TRUE)]), decreasing = TRUE), main = "DE genes chromosome distribution", ylab = "Number of DE genes", xlab = "Chromosome", cex.axis = 0.7)
ttc <- tt.sign
ttc$chr <- sub( "^chr([X|Y|0-9]+.*)", "\\1", ttc$chr, perl = T)
ttc$chr <- sub( "(.*)_.*_.*$", "\\1A", ttc$chr, perl = T)
plot(sort(table(ttc$chr[abs(ttc$logFC) > 2 & !grepl(".*[A|v][1]*$", ttc$chr, perl=TRUE)]), decreasing = TRUE), main = "DE genes chromosome distribution", ylab = "Number of DE genes", xlab = "Chromosome", cex.axis = 0.7)
Figure 10: Chromosome distribution of the differentially expressed genes
The distribution is shown in Figure 10, and it does not correspond to a distribution based on the chromosome size, as some chromosomes are located higher in the ranking than what would be expected by its size. The chromosome Y does not have any differentially expressed gene.
The distribution of raw p-values and the QQ plot moderated t-statistics, displayed in Figure 11, can be used as a diagnosis of the tests performed.
par(mfrow = c(1, 2), mar = c(4, 5, 2, 2))
hist(tt$P.Value, xlab = "Raw P-values", main = "", las = 1)
qqt(fit$t[, 2], df = fit$df.prior + fit$df.residual, main = "", pch = ".", cex = 3)
abline(0, 1, lwd = 2)
Figure 11: Raw p-values distribution and QQ plot of the moderated t-statistics for DE analysis
Under the null hypothesis, which is that most genes are not differentially expressed, the raw p-value distribution should be uniform with a peak at low p-values for the truly DE genes. This coincides with the observed plot.
Regarding the QQ plot, it represents the quantiles of the data sample against the theoretical distribution they should follow. The diagonal represents the null hypothesis. Most genes are off-diagonal, indicating differential expression. The fact tha most of the genes are far from the null hypothesis is a measure of the significance of the results.
The differential expression results are diagnosed with volcano and MA plots, in which the top 10 differentially expressed genes are highlighted as shown in Figure 12. Volcano plots are scatter plots that summarize a measure of significance, such as the p-values (which can be in logarithmic scale as the plot below), and the log-fold change, which represents the measure of differential expression. Significant results will appear towards the plot, while fold changes increase to right for upregulated genes and to left for downregulated genes.
The top differentially expressed genes are tagged for both plots.
top <- order(fit$lods[, 2], decreasing = TRUE)[1:10]
fit$genes$symbol[top]
[1] "FAM9A" "UFM1" "ZFR2" "FGFR3" "RBMX" "SEMG1" "PPP1R37"
[8] "ERH" "EXOC6" "MIR4687"
par(mfrow = c(1, 2), mar = c(4, 5, 3, 2))
volcanoplot(fit, coef = 2, highlight = 10, names = fit$genes$symbol, main = "Volcano Plot")
limma::plotMA(fit, coef = 2, status = rownames(fit$lods) %in% signDEgenes, legend = FALSE,
main = "MA Plot", hl.pch = 46, hl.cex = 4, bg.pch = 46, bg.cex = 3, las = 1)
text(fit$Amean[top], fit$coef[top, 2], fit$genes$symbol[top], cex = 0.5, pos = 4)
Figure 12: Volcano and MA plots for DE analysis without SVA
Both the volcano and MA plot show that the results do not show problems such as unexpected trends.
sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.5
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] GO.db_3.7.0 GOstats_2.48.0
[3] Category_2.48.1 Matrix_1.2-17
[5] GSVA_1.30.0 GSVAdata_1.18.0
[7] hgu95a.db_3.2.3 org.Hs.eg.db_3.7.0
[9] GSEABase_1.44.0 graph_1.60.0
[11] sva_3.30.1 genefilter_1.64.0
[13] mgcv_1.8-28 nlme_3.1-140
[15] geneplotter_1.60.0 annotate_1.60.1
[17] XML_3.98-1.20 AnnotationDbi_1.44.0
[19] lattice_0.20-38 edgeR_3.24.3
[21] limma_3.38.3 SummarizedExperiment_1.12.0
[23] DelayedArray_0.8.0 BiocParallel_1.16.6
[25] matrixStats_0.54.0 Biobase_2.42.0
[27] GenomicRanges_1.34.0 GenomeInfoDb_1.18.2
[29] IRanges_2.16.0 S4Vectors_0.20.1
[31] BiocGenerics_0.28.0 knitr_1.23
[33] BiocStyle_2.10.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.1 locfit_1.5-9.1 digest_0.6.19
[4] mime_0.7 R6_2.4.0 RSQLite_2.1.1
[7] evaluate_0.14 highr_0.8 zlibbioc_1.28.0
[10] Rgraphviz_2.26.0 blob_1.1.1 rmarkdown_1.13
[13] shinythemes_1.1.2 splines_3.5.3 stringr_1.4.0
[16] RCurl_1.95-4.12 bit_1.1-14 shiny_1.3.2
[19] compiler_3.5.3 httpuv_1.5.1 xfun_0.7
[22] pkgconfig_2.0.2 htmltools_0.3.6 GenomeInfoDbData_1.2.0
[25] bookdown_0.11 codetools_0.2-16 AnnotationForge_1.24.0
[28] later_0.8.0 bitops_1.0-6 grid_3.5.3
[31] RBGL_1.58.2 xtable_1.8-4 DBI_1.0.0
[34] magrittr_1.5 stringi_1.4.3 XVector_0.22.0
[37] promises_1.0.1 RColorBrewer_1.1-2 tools_3.5.3
[40] bit64_0.9-7 survival_2.44-1.1 yaml_2.2.0
[43] BiocManager_1.30.4 memoise_1.1.0
In addition to identifying differential expressed genes by adjusting surrogate variables in the differential expression analysis, a gene set enrichment analysis (GSEA) was also completed. This procedure provided more sensitivity in identifying gene expression changes. This analysis assessed how genes are behaving differently between two phenotypic states. The phenotypic states in this study were tumor and normal samples. The analysis calculated an enrichment score for each gene set. This score provided information on the changes in gene expression by inidividual genes in the gene set.
The first step in running a GSEA was to select a collection of gene sets from the MSigDB gene set collections provided by the Broad Institute. The C2 collection contained 3272 computational gene sets made up of cancer-oriented microarray data. This collection was well-curated and large, which made it an ideal candidate to use in the analysis.
library(GSEABase)
geneUniverse <- rownames(se.filt)
length(geneUniverse)
[1] 11872
library(GSVAdata)
data(c2BroadSets)
c2BroadSets
GeneSetCollection
names: NAKAMURA_CANCER_MICROENVIRONMENT_UP, NAKAMURA_CANCER_MICROENVIRONMENT_DN, ..., ST_PHOSPHOINOSITIDE_3_KINASE_PATHWAY (3272 total)
unique identifiers: 5167, 100288400, ..., 57191 (29340 total)
types in collection:
geneIdType: EntrezIdentifier (1 total)
collectionType: BroadCollection (1 total)
length(c2BroadSets)
[1] 3272
After the collection was selected for the GSEA, the next step was to map the identifiers from the 3272 C2 gene sets to the dataset being analyzed.
gsc <- GeneSetCollection(c(c2BroadSets))
# Map identifiers
gsc <- mapIdentifiers(gsc, AnnoOrEntrezIdentifier(metadata(se.filt)$annotation))
gsc
GeneSetCollection
names: NAKAMURA_CANCER_MICROENVIRONMENT_UP, NAKAMURA_CANCER_MICROENVIRONMENT_DN, ..., ST_PHOSPHOINOSITIDE_3_KINASE_PATHWAY (3272 total)
unique identifiers: 5167, 100288400, ..., 57191 (29340 total)
types in collection:
geneIdType: EntrezIdentifier (1 total)
collectionType: BroadCollection (1 total)
A matrix was created to check if the identifiers were properly mapped, or matched to the corresponding identifier in the data set being analyzed.
Im <- incidence(gsc)
dim(Im)
[1] 3272 29340
Im[1:2, 1:10]
5167 100288400 338328 388 10631 440387
NAKAMURA_CANCER_MICROENVIRONMENT_UP 1 1 1 1 1 1
NAKAMURA_CANCER_MICROENVIRONMENT_DN 0 0 0 0 0 0
54910 10136 3630 6662
NAKAMURA_CANCER_MICROENVIRONMENT_UP 1 1 1 1
NAKAMURA_CANCER_MICROENVIRONMENT_DN 0 0 0 0
Im <- Im[, colnames(Im) %in% rownames(se.filt)]
dim(Im)
[1] 3272 9963
Since the genes from the data set being analyzed are the only genes of interest, all genes that were not a part of this data set, denoted by a “0”, were discarded.
se.filtgsea <- se.filt[colnames(Im), ]
dge.filtgsea <- dge.filt[colnames(Im), ]
This left 11872 genes to be analyzed among 100 gene sets from the collection. With the data ready, a SVA was completed without calling any gene differentially expressed. The mean-variance was the same from the previous differential expression analysis so the plot was not displayed.
patientid <- substr(colnames(se.filtgsea), 9, 12)
mod <- model.matrix(~type + patientid, data = colData(se.filtgsea))
mod0 <- model.matrix(~patientid, data = colData(se.filtgsea))
sv <- sva(assays(se.filtgsea)$logCPM, mod, mod0)
Number of significant surrogate variables is: 12
Iteration (out of 5 ):1 2 3 4 5
# Add surrogates to the model
len <- length(colnames(mod))
mod <- cbind(mod, sv$sv)
colnames(mod) <- c(colnames(mod)[1:len], paste0("SV", 1:sv$n))
v <- voom(dge.filt, mod)
fit <- lmFit(v, mod)
fit <- eBayes(fit)
tt <- topTable(fit, coef = 2, n = Inf)
With the t-statistics avaialble, the z-scores were then calculated to see if a shift in gene expression within a gene set was present. First, a filter was set that required the gene sets to have a minimum size of 5 genes.
Im <- Im[rowSums(Im) >= 5, ]
dim(Im)
[1] 2760 9963
The t-statistics were then stored in a vector in order of the incidence matrix. The z-score was then calculated and sorted by the absolute score. The z-score of 5 indicated that this gene set was about 5 units above the mean. A low z-score suggested the genes in the gene set were not differentially expressed.
tGSgenes <- tt[match(colnames(Im), rownames(tt)), "t"]
length(tGSgenes)
[1] 9963
head(tGSgenes)
[1] 6.340600 4.823735 2.478672 -15.082266 3.014485 5.436968
zS <- sqrt(rowSums(Im)) * (as.vector(Im %*% tGSgenes)/rowSums(Im))
length(zS)
[1] 2760
head(zS)
NAKAMURA_CANCER_MICROENVIRONMENT_UP NAKAMURA_CANCER_MICROENVIRONMENT_DN
-3.668597 -3.829943
WEST_ADRENOCORTICAL_TUMOR_MARKERS_UP WEST_ADRENOCORTICAL_TUMOR_MARKERS_DN
-2.733001 -7.887280
WINTER_HYPOXIA_UP WINTER_HYPOXIA_DN
3.749312 9.144136
rnkGS <- sort(abs(zS), decreasing = TRUE)
head(rnkGS)
WHITEFORD_PEDIATRIC_CANCER_MARKERS
43.80714
GEORGES_TARGETS_OF_MIR192_AND_MIR215
41.32740
DODD_NASOPHARYNGEAL_CARCINOMA_DN
40.30265
PUJANA_XPRSS_INT_NETWORK
39.89832
KEGG_CELL_CYCLE
39.01311
SPIELMAN_LYMPHOBLAST_EUROPEAN_VS_ASIAN_DN
38.80339
A one sample z-test was calculated and a conservative multiple testing adjustment was administered to see if any gene sets were candidates for differentially expressed genes. The reason for completing a multiple testing adjustment was due to gene set overlaps.
pv <- pmin(pnorm(zS), 1 - pnorm(zS))
pvadj <- p.adjust(pv, method = "fdr")
DEgs <- names(pvadj)[which(pvadj < 0.01)]
length(DEgs)
[1] 2313
As shown, 2313 gene sets were differentially expressed.
In addition to the GSEA, a gene set variation analysis (GSVA) was performed. This differs from the standard GSEA by utilizing a gene set by sample matrix rather than a gene by sample matrix. This difference allowed for the pathway enrichment for each individual sample to be analyzed.
In order to do this, a matrix was created with the number of gene sets and an enrichment score that indicates sample-wise gene-level summaries of expression. Since gene sets with a small amount of genes don’t provide much information, a filter was set to remove those gene sets with 5 or less genes.
library(GSVA)
GSexpr <- gsva(assays(se.filt)$logCPM, gsc, min.sz=5, max.sz=300, verbose=FALSE)
dim(GSexpr)
[1] 2698 100
With the data ready, a SVA was completed without calling any gene differentially expressed.
mod <- model.matrix(~se.filt$type + patientid, data = colData(se.filt))
mod0 <- model.matrix(~ patientid, data = colData(se.filt))
svaobj <- sva(GSexpr, mod, mod0)
Number of significant surrogate variables is: 11
Iteration (out of 5 ):1 2 3 4 5
modSVs <- cbind(mod, svaobj$sv)
corfit <- duplicateCorrelation(GSexpr, modSVs, block = se.filt$cellline)
fit <- lmFit(GSexpr, modSVs)
fit <- eBayes(fit)
tt <- topTable(fit, coef = 2, n = Inf)
DEgs <- rownames(tt[tt$adj.P.Val < 0.01, , drop = FALSE])
length(DEgs)
[1] 1777
The 1777 gene sets that are differentially expressed at 1% FDR are shown in red in Figure 13.
Figure 13: Volcano and MA plots for GVSA
The next step was to complete a gene ontology analysis. This procedure was completed to see if any differentially expressed genes were associated with certain biological processes defined by the Gene Ontology system.
First, a parameter object was built that specified the gene universe of interest, the set of 635 DE genes, the ontology, and other information. The ontology selected was “BP”, which matched genes to the Biological Processes associated with the GO Terms.
geneUniverse <- rownames(se.filt)
length(geneUniverse)
[1] 11872
library(GOstats)
params <- new("GOHyperGParams", geneIds=DEgenes, universeGeneIds=geneUniverse,
annotation="org.Hs.eg.db", ontology="BP",
pvalueCutoff=0.01, testDirection="over")
In gene ontology analysis, GO terms can sometime overlap. Therefore, a conditional test was chosen. An Odds Ratio of “Inf” indicated that all genes within a gene set were differentially expressed.
conditional(params) <- TRUE
hgOverCond <- hyperGTest(params)
hgOverCond
Gene to GO BP Conditional test for over-representation
7672 GO BP ids tested (44 have p < 0.01)
Selected gene set size: 1242
Gene universe size: 9840
Annotation package: org.Hs.eg
goresults <- summary(hgOverCond)
head(goresults)
GOBPID Pvalue OddsRatio ExpCount Count Size
1 GO:0007189 0.0005206625 2.638730 9.2140244 20 73
2 GO:0002011 0.0006846083 4.034760 3.7865854 11 30
3 GO:0010876 0.0007214380 1.821679 26.3798780 43 209
4 GO:0006821 0.0010321534 2.619485 8.3304878 18 66
5 GO:0007172 0.0011364481 27.777060 0.6310976 4 5
6 GO:0035627 0.0011364481 27.777060 0.6310976 4 5
Term
1 adenylate cyclase-activating G protein-coupled receptor signaling pathway
2 morphogenesis of an epithelial sheet
3 lipid localization
4 chloride transport
5 signal complex assembly
6 ceramide transport
Since signficiantly enriched gene sets formed by a few genes or a lot of genes are not very reliable, a filter was added so those sets with less than or equal to 5 genes or more than 250 genes were removed. In addition, a filter was set on the Odds Ratio to find any above 1.5, which was a 50% increase over the null Odds Ratio of 1. The table was than ordered by Odds Ratio.
goresults <- goresults[goresults$Size >= 5 & goresults$Size <= 250 & goresults$OddsRatio >= 1.5, ]
goresults <- goresults[order(goresults$OddsRatio, decreasing=TRUE), ]
goresults
GOBPID Pvalue OddsRatio ExpCount Count Size
5 GO:0007172 0.0011364481 27.777060 0.6310976 4 5
6 GO:0035627 0.0011364481 27.777060 0.6310976 4 5
7 GO:0046541 0.0011364481 27.777060 0.6310976 4 5
16 GO:0015886 0.0030676261 13.886914 0.7573171 4 6
27 GO:0001522 0.0064437510 9.256866 0.8835366 4 7
28 GO:0022010 0.0064437510 9.256866 0.8835366 4 7
29 GO:0038063 0.0064437510 9.256866 0.8835366 4 7
30 GO:0055064 0.0064437510 9.256866 0.8835366 4 7
20 GO:0045737 0.0046263593 6.946645 1.2621951 5 10
40 GO:0070875 0.0076078630 5.788197 1.3884146 5 11
33 GO:0048025 0.0072588377 4.632686 1.8932927 6 15
34 GO:1904754 0.0072588377 4.632686 1.8932927 6 15
14 GO:0006376 0.0028050015 4.281262 2.6506098 8 21
26 GO:0019369 0.0063707466 4.055466 2.3981707 7 19
2 GO:0002011 0.0006846083 4.034760 3.7865854 11 30
41 GO:0010171 0.0087513733 3.743071 2.5243902 7 20
23 GO:0003416 0.0053654763 3.709562 2.9030488 8 23
13 GO:1990573 0.0022246253 3.331508 4.2914634 11 34
25 GO:0031055 0.0059093774 3.026186 4.1652439 10 33
43 GO:0008333 0.0095090718 2.981230 3.7865854 9 30
21 GO:0034508 0.0046554698 2.946073 4.6701220 11 37
42 GO:0006336 0.0092703829 2.783442 4.4176829 10 35
1 GO:0007189 0.0005206625 2.638730 9.2140244 20 73
4 GO:0006821 0.0010321534 2.619485 8.3304878 18 66
18 GO:0098661 0.0042154616 2.374371 7.9518293 16 63
8 GO:0007187 0.0014057171 2.011321 16.4085366 29 130
22 GO:0042180 0.0047041428 1.950281 13.8841463 24 110
17 GO:0019935 0.0041579914 1.913764 15.2725610 26 121
3 GO:0010876 0.0007214380 1.821679 26.3798780 43 209
9 GO:0006631 0.0020002701 1.777542 23.7292683 38 188
15 GO:0030198 0.0029433800 1.730552 24.2341463 38 192
19 GO:0051222 0.0043438565 1.636833 28.0207317 42 222
32 GO:0042391 0.0071416215 1.554687 31.3024390 45 248
Term
5 signal complex assembly
6 ceramide transport
7 saliva secretion
16 heme transport
27 pseudouridine synthesis
28 central nervous system myelination
29 collagen-activated tyrosine kinase receptor signaling pathway
30 chloride ion homeostasis
20 positive regulation of cyclin-dependent protein serine/threonine kinase activity
40 positive regulation of glycogen metabolic process
33 negative regulation of mRNA splicing, via spliceosome
34 positive regulation of vascular associated smooth muscle cell migration
14 mRNA splice site selection
26 arachidonic acid metabolic process
2 morphogenesis of an epithelial sheet
41 body morphogenesis
23 endochondral bone growth
13 potassium ion import across plasma membrane
25 chromatin remodeling at centromere
43 endosome to lysosome transport
21 centromere complex assembly
42 DNA replication-independent nucleosome assembly
1 adenylate cyclase-activating G protein-coupled receptor signaling pathway
4 chloride transport
18 inorganic anion transmembrane transport
8 G protein-coupled receptor signaling pathway, coupled to cyclic nucleotide second messenger
22 cellular ketone metabolic process
17 cyclic-nucleotide-mediated signaling
3 lipid localization
9 fatty acid metabolic process
15 extracellular matrix organization
19 positive regulation of protein transport
32 regulation of membrane potential
The table above identified GO terms of interest from the analysis. Some interesting terms related to lung cancer include signal complex assembly, ceramid transport, positive regulation of cyclin-dependent protein serine/threonine kinase actvity, and mRNA splice site selection.
sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.5
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] GO.db_3.7.0 GOstats_2.48.0
[3] Category_2.48.1 Matrix_1.2-17
[5] GSVA_1.30.0 GSVAdata_1.18.0
[7] hgu95a.db_3.2.3 org.Hs.eg.db_3.7.0
[9] GSEABase_1.44.0 graph_1.60.0
[11] sva_3.30.1 genefilter_1.64.0
[13] mgcv_1.8-28 nlme_3.1-140
[15] geneplotter_1.60.0 annotate_1.60.1
[17] XML_3.98-1.20 AnnotationDbi_1.44.0
[19] lattice_0.20-38 edgeR_3.24.3
[21] limma_3.38.3 SummarizedExperiment_1.12.0
[23] DelayedArray_0.8.0 BiocParallel_1.16.6
[25] matrixStats_0.54.0 Biobase_2.42.0
[27] GenomicRanges_1.34.0 GenomeInfoDb_1.18.2
[29] IRanges_2.16.0 S4Vectors_0.20.1
[31] BiocGenerics_0.28.0 knitr_1.23
[33] BiocStyle_2.10.0
loaded via a namespace (and not attached):
[1] bit64_0.9-7 splines_3.5.3 shiny_1.3.2
[4] statmod_1.4.32 BiocManager_1.30.4 highr_0.8
[7] RBGL_1.58.2 blob_1.1.1 GenomeInfoDbData_1.2.0
[10] yaml_2.2.0 RSQLite_2.1.1 digest_0.6.19
[13] RColorBrewer_1.1-2 promises_1.0.1 XVector_0.22.0
[16] htmltools_0.3.6 httpuv_1.5.1 pkgconfig_2.0.2
[19] bookdown_0.11 zlibbioc_1.28.0 xtable_1.8-4
[22] later_0.8.0 survival_2.44-1.1 magrittr_1.5
[25] mime_0.7 memoise_1.1.0 evaluate_0.14
[28] tools_3.5.3 stringr_1.4.0 locfit_1.5-9.1
[31] compiler_3.5.3 grid_3.5.3 RCurl_1.95-4.12
[34] AnnotationForge_1.24.0 bitops_1.0-6 rmarkdown_1.13
[37] codetools_0.2-16 DBI_1.0.0 R6_2.4.0
[40] bit_1.1-14 shinythemes_1.1.2 Rgraphviz_2.26.0
[43] stringi_1.4.3 Rcpp_1.0.1 xfun_0.7